You can access the course materials quickly from
Fundamental Techniques in Data Science with R
You can access the course materials quickly from
| Week # | Topic | R-practical |
Workgroup |
|---|---|---|---|
| 1 | The elemental building blocks of R |
Assigning objects and elements; creating vectors, matrices, dataframes and lists | Receive instructions and form groups |
| 2 | Finding the least squares solution; simple linear regression | Subsetting data; using pipes to simplify the workflow | Locate a data set for predictive modeling and formulate a research hypothesis; make sure that the set facilitates continuous and dichotomous outcomes |
| 3 | Linear modeling in R; testing assumptions; standardized residuals, leverage and Cook’s distance |
Class lm in R; modeling, prediction and visualization |
Fit your defined model; evaluate if assumptions are met |
| 4 | Inferential modeling; Confidence intervals and hypothesis testing, non-constant error variance | Demonstrate confidence validity of the linear model on simulated data with rmarkdown |
Test and quantify the effect of the defined model; continue the project in rmarkdown |
| 5 | Model evaluation; cross-validation; categorical variables, non-linear relations, interactions and higher-order polynomials | Cross-validation and model fit in R |
Evaluate if the model can be improved; Prepare assignment A; evaluate the final linear model on your own data |
| Week # | Topic | R-practical |
Workgroup |
|---|---|---|---|
| 6 | Simple logistic regression | Class glm(formula, family = "binomial") in R; modeling, prediction and visualization |
Fit your defined model; evaluate if assumptions are met |
| 7 | Formulating the logistic model and interpreting the parameters; marginal effects | Parameter transformations; scale of the predictor/outcome and prediction and confidence intervals | Test and quantify the effect of the defined model |
| 8 | Logistic regression model evaluation; cross-validation; multiple regression; interactions | Multiple logistic regression and cross-validating the logistic regression in R |
Evaluate if the model can be improved; Prepare assignment B; evaluate the final logistic model on your own data |
This means that you will learn the ins and outs of inferential and predictive research with linear and logistic models.
R to perform our data analysis and visualizationsLearn to keep your cool
and build the foundation for a succesfull scripting career in predictive and inferential analytics
R is a language and environment for statistical computing and for graphics
GNU project (100% free software)
Managed by the R Foundation for Statistical Computing, Vienna, Austria.
Community-driven
Based on the object-oriented language S (1975)
R works with objects that consist of elements. The smallest elements are numbers and characters.
R Archive Network (CRAN) and is aimed at R users, must be accompanied by a help file.anova(), then you just type ?anova or help(anova) in the console.If you do not know the name of the function: type ?? followed by your search criterion. For example ??anova returns a list of all help pages that contain the word ‘anova’
Google can be of tremendous help.
R related issues; use ‘R:’ as a prefix in your search termAssigning things in R is very straightforward:
<-For example, if you assign the value 100 (an element) to object a, you would type
a <- 100
Calling things in R is also very straightforward:
For example, we assigned the value 100 to object a. To call object a, we would type
a
## [1] 100
This is why we use R-Studio.
a <- c(1, 2, 3, 4, 5) a
## [1] 1 2 3 4 5
b <- 1:5 b
## [1] 1 2 3 4 5
Characters (or character strings) in R are indicated by the double quote identifier.
a.new <- c(a, "A") a.new
## [1] "1" "2" "3" "4" "5" "A"
Notice the difference with a from the previous slide
a
## [1] 1 2 3 4 5
rep(a, 15)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ## [36] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ## [71] 1 2 3 4 5
If we would want just the third element, we would type
a[3]
## [1] 3
This we would refer to as a matrix
c <- matrix(a, nrow = 5, ncol = 2) c
## [,1] [,2] ## [1,] 1 1 ## [2,] 2 2 ## [3,] 3 3 ## [4,] 4 4 ## [5,] 5 5
c[1, ]
## [1] 1 1
c[, 2]
## [1] 1 2 3 4 5
c[1, 2]
## [1] 1
In short; square brackets [] are used to call elements, rows, columns (and much more beyond the scope of this course)
If we add a character column to matrix c; everything becomes a character:
cbind(c, letters[1:5])
## [,1] [,2] [,3] ## [1,] "1" "1" "a" ## [2,] "2" "2" "b" ## [3,] "3" "3" "c" ## [4,] "4" "4" "d" ## [5,] "5" "5" "e"
Alternatively,
cbind(c, c("a", "b", "c", "d", "e"))
## [,1] [,2] [,3] ## [1,] "1" "1" "a" ## [2,] "2" "2" "b" ## [3,] "3" "3" "c" ## [4,] "4" "4" "d" ## [5,] "5" "5" "e"
Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.
d <- data.frame("V1" = rnorm(5),
"V2" = rnorm(5, mean = 5, sd = 2),
"V3" = letters[1:5])
d
## V1 V2 V3 ## 1 0.08277864 5.849016 a ## 2 0.54542935 3.292187 b ## 3 0.19208169 2.294143 c ## 4 -0.65761705 2.523033 d ## 5 -0.85008235 4.005873 e
We ‘filled’ a dataframe with two randomly generated sets from the normal distribution - where \(V1\) is standard normal and \(V2 \sim N(5,2)\) - and a character set.
Data frames can contain both numerical and character elements at the same time, although never in the same column.
You can name the columns and rows in data frames (just like in matrices)
row.names(d) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
d
## V1 V2 V3 ## row 1 0.08277864 5.849016 a ## row 2 0.54542935 3.292187 b ## row 3 0.19208169 2.294143 c ## row 4 -0.65761705 2.523033 d ## row 5 -0.85008235 4.005873 e
There are two ways to obtain row 3 from data frame d:
d["row 3", ]
## V1 V2 V3 ## row 3 0.1920817 2.294143 c
and
d[3, ]
## V1 V2 V3 ## row 3 0.1920817 2.294143 c
The intersection between row 2 and column 4 can be obtained by
d[2, 3]
## [1] b ## Levels: a b c d e
Both
d[, "V2"] # and
## [1] 5.849016 3.292187 2.294143 2.523033 4.005873
d[, 2]
## [1] 5.849016 3.292187 2.294143 2.523033 4.005873
yield the second column. But we can also use $ to call variable names in data frame objects
d$V2
## [1] 5.849016 3.292187 2.294143 2.523033 4.005873
If you wish to use numerical objects that have more than two dimension, an array would be a suitable object. The following code yields a 3-dimensional array (2 rows, 4 columns and 3 matrices):
e <- array(1:24, dim = c(2, 4, 3)) e
## , , 1 ## ## [,1] [,2] [,3] [,4] ## [1,] 1 3 5 7 ## [2,] 2 4 6 8 ## ## , , 2 ## ## [,1] [,2] [,3] [,4] ## [1,] 9 11 13 15 ## [2,] 10 12 14 16 ## ## , , 3 ## ## [,1] [,2] [,3] [,4] ## [1,] 17 19 21 23 ## [2,] 18 20 22 24
The square bracket identification works similarly to the identification of matrices and dataframes, but with the added dimension(s). For example,
e[1, 3, 2]
## [1] 13
yields the element in the first row of the third column in the second matrix. This is exactly the downside to an array: it is a series of matrices.
In other words, characters and numerical elements may not be mixed.
If we replace the third matrix in the array by a character version of that matrix, we obtain
e[, , 3] <- as.character(e[, , 3]) e
## , , 1 ## ## [,1] [,2] [,3] [,4] ## [1,] "1" "3" "5" "7" ## [2,] "2" "4" "6" "8" ## ## , , 2 ## ## [,1] [,2] [,3] [,4] ## [1,] "9" "11" "13" "15" ## [2,] "10" "12" "14" "16" ## ## , , 3 ## ## [,1] [,2] [,3] [,4] ## [1,] "17" "19" "21" "23" ## [2,] "18" "20" "22" "24"
List are just what it says they are: lists. You can have a list of everything mixed with everything. For example, an simple list can be created by
f <- list(a) f
## [[1]] ## [1] 1 2 3 4 5
Elements or objects within lists can be called by using double square brackets [[]]. For example, the first (and only) element in list f is object a
f[[1]]
## [1] 1 2 3 4 5
We can simply add an object or element to an existing list
f[[2]] <- d f
## [[1]] ## [1] 1 2 3 4 5 ## ## [[2]] ## V1 V2 V3 ## row 1 0.08277864 5.849016 a ## row 2 0.54542935 3.292187 b ## row 3 0.19208169 2.294143 c ## row 4 -0.65761705 2.523033 d ## row 5 -0.85008235 4.005873 e
to obtain a list with a vector and a data frame.
We can add names to the list as follows
names(f) <- c("vector", "data frame")
f
## $vector ## [1] 1 2 3 4 5 ## ## $`data frame` ## V1 V2 V3 ## row 1 0.08277864 5.849016 a ## row 2 0.54542935 3.292187 b ## row 3 0.19208169 2.294143 c ## row 4 -0.65761705 2.523033 d ## row 5 -0.85008235 4.005873 e
Calling the vector (a) from the list can be done as follows
f[[1]]
## [1] 1 2 3 4 5
f[["vector"]]
## [1] 1 2 3 4 5
f$vector
## [1] 1 2 3 4 5
Take the following example
g <- list(f, f)
To call the vector from the second list within the list g, use the following code
g[[2]][[1]]
## [1] 1 2 3 4 5
g[[2]]$vector
## [1] 1 2 3 4 5
Logical operators are signs that evaluate a statement, such as ==, <, >, <=, >=, and | (OR) as well as & (AND). Typing ! before a logical operator takes the complement of that action. There are more operations, but these are the most useful.
For example, if we would like elements out of matrix c that are larger than 3, we would type:
c[c > 3]
## [1] 4 5 4 5
c > 3
## [,1] [,2] ## [1,] FALSE FALSE ## [2,] FALSE FALSE ## [3,] FALSE FALSE ## [4,] TRUE TRUE ## [5,] TRUE TRUE
The column values for TRUE may be of different length. A vector as a return is therefore more appropriate.
c[c < 3 | c > 3] #c smaller than 3 or larger than 3
## [1] 1 2 4 5 1 2 4 5
or
c[c != 3] #c not equal to 3
## [1] 1 2 4 5 1 2 4 5
c != 3 returns a matrix## [,1] [,2] ## [1,] TRUE TRUE ## [2,] TRUE TRUE ## [3,] FALSE FALSE ## [4,] TRUE TRUE ## [5,] TRUE TRUE
c?:## [,1] [,2] ## [1,] 1 1 ## [2,] 2 2 ## [3,] 3 3 ## [4,] 4 4 ## [5,] 5 5
0 / 0
## [1] NaN
mean(c(1, 2, NA, 4, 5))
## [1] NA
There are two easy ways to perform “listwise deletion”:
mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3
mean(na.omit(c(1, 2, NA, 4, 5)))
## [1] 3
(3 - 2.9)
## [1] 0.1
(3 - 2.9) <= 0.1
## [1] FALSE
Why does R tell us that 3 - 2.9 \(\neq\) 0.1?
(3 - 2.9) - .1
## [1] 8.326673e-17
#) to clarify what you are doing
R-scripts
RStudio projectsRRThere are several ‘layers’ in R. Some layers you are allowed to fiddle around in, some are forbidden. In general there is the following distinction:
The global environment can be seen as a olympic-size swimming pool. Everything you do has its place there.
If you’d like, you may create another, separate environment to work in.
If you create a function, it is positioned in the global environment.
Everything that happens in a function, stays in a function. Unless you specifically tell the function to share the information with the global environment.
See functions as a shampoo bottle in a swimming pool to which you add some water. If you’d like to see the color of the mixture, you’d have to squeeze the bottle for it to come out.
Packages have their own space.
There are two ways to load a package in R
library(stats)
and
require(stats)
require() will produce a warning when a package is not found. In other words, it will not stop as function library() does.
The easiest way to install e.g. package mice is to use
install.packages("mice")
Alternatively, you can also do it in RStudio through
Tools --> Install Packages
R in depthA workspace contains all changes you made to environments, functions and namespaces.
A saved workspace contains everything at the time of the state wherein it was saved.
You do not need to run all the previous code again if you would like to continue working at a later time.
Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets from raw text.
R by default saves (part of) the code history and RStudio expands this functionality greatly.
Most often it may be useful to look back at the code history for various reasons.
There are multiple ways to access the code history.
RStudioRTo model objects based on other objects, we use ~ (tilde)
- For example, to model body mass index (BMI) on weight, we would type
BMI ~ weight
Tilde is used to separate the left- and right-hand sides in a model formula.
For functions (or models), within models we use I() - For example, to model body mass index (BMI) on its deterministic function of weight and height, we would type
BMI ~ I(weight / height^2)
Remember the boys data from package mice:
lm(bmi ~ wgt, data = boys)
## ## Call: ## lm(formula = bmi ~ wgt, data = boys) ## ## Coefficients: ## (Intercept) wgt ## 14.5401 0.0935
Remember the boys data from package mice:
lm(bmi ~ I(wgt / (hgt / 100)^2), data = boys)
## ## Call: ## lm(formula = bmi ~ I(wgt/(hgt/100)^2), data = boys) ## ## Coefficients: ## (Intercept) I(wgt/(hgt/100)^2) ## -0.005553 1.000034
It is ‘nicer’ to store the output from the function in an object. The convention for regression models is an object called fit.
fit <- lm(bmi ~ I(wgt / (hgt / 100)^2), data = boys)
The object fit contains a lot more than just the regression weights. To inspect what is inside you can use
ls(fit)
## [1] "assign" "call" "coefficients" "df.residual" ## [5] "effects" "fitted.values" "model" "na.action" ## [9] "qr" "rank" "residuals" "terms" ## [13] "xlevels"
fitAnother approach to inspecting the contents of fit is the function attributes()
attributes(fit)
## $names ## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "na.action" "xlevels" "call" "terms" ## [13] "model" ## ## $class ## [1] "lm"
The benefit of using attributes() is that it directly tells you the class of the object.
class(fit)
## [1] "lm"
Classes are used for an object-oriented style of programming. This means that you can write a specific function that - has fixed requirements with respect to the input. - presents output or graphs in a predefined manner.
When a generic function fun is applied to an object with class attribute c("first", "second"), the system searches for a function called fun.first and, if it finds it, applies it to the object.
If no such function is found, a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.
plot(bmi ~ wgt, data = boys)
plot(lm(bmi ~ wgt, data = boys), which = 1)
plot(lm(bmi ~ wgt, data = boys), which = 2)
plot(lm(bmi ~ wgt, data = boys), which = 3)
plot(lm(bmi ~ wgt, data = boys), which = 4)
plot(lm(bmi ~ wgt, data = boys), which = 5)
plot(lm(bmi ~ wgt, data = boys), which = 6)
"lm"?The function plot() is called, but not used. Instead, because the linear model has class "lm", R searches for the function plot.lm().
If function plot.lm() would not exist, R tries to apply function plot() (which would have failed in this case because plot requires x and y as input)
plot.lm() is created by John Maindonald and Martin Maechler. They thought it would be useful to have a standard plotting environment for objects with class "lm".
Since the elements that class "lm" returns are known, creating a generic function class is straightforward.
R-coding File names should end in .R and, of course, be meaningful.
GOOD:
predict_ad_revenue.R
BAD:
foo.R
Don’t use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions.
variable.name is preferred, variableName is accepted
GOOD: avg.clicks
OK: avgClicks
BAD: avg_Clicks
CalculateAvgClickscalculate_avg_clicks , calculateAvgClicks kConstantName
The maximum line length is 80 characters.
# This is to demonstrate that at about eighty characters you would move off of the page
# Also, if you have a very wide function
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys)
# it would be nice to pose it as
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt
+ bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
#or
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg
+ bmi * hgt
+ bmi * wgt
+ wgt * hgt
+ wgt * hgt * bmi,
data = boys)
When indenting your code, use two spaces. RStudio does this for you!
Never use tabs or mix tabs and spaces.
Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.
Place spaces around all binary operators (=, +, -, <-, etc.).
Exception: Spaces around =’s are optional when passing parameters in a function call.
lm(age ~ bmi, data=boys)
or
lm(age ~ bmi, data = boys)
Do not place a space before a comma, but always place one after a comma.
GOOD:
tab.prior <- table(df[df$days.from.opt < 0, "campaign.id"]) total <- sum(x[, 1]) total <- sum(x[1, ])
BAD:
# Needs spaces around '<' tab.prior <- table(df[df$days.from.opt<0, "campaign.id"]) # Needs a space after the comma tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"]) # Needs a space before <- tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"]) # Needs spaces around <- tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"]) # Needs a space after the comma total <- sum(x[,1]) # Needs a space after the comma, not before total <- sum(x[ ,1])
Place a space before left parenthesis, except in a function call.
GOOD:
if (debug)
BAD:
if(debug)
Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).
plot(x = x.coord,
y = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")],
ylim = ylim,
xlab = "dates",
ylab = metric,
main = (paste(metric, " for 3 samples ", sep = "")))
Do not place spaces around code in parentheses or square brackets.
Exception: Always place a space after a comma.
GOOD:
if (debug) x[1, ]
BAD:
if ( debug ) # No spaces around debug x[1,] # Needs a space after the comma
Use common sense and BE CONSISTENT.
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.